Picture for Xiangyu Zhang

Xiangyu Zhang

The WER Trap: Shattering the Illusion of Unified Tokens in Speech Language Models

Add code
May 28, 2026
Viaarxiv icon

Diagnosing Live Within-Policy Instruction Conflicts in LLM Agents with Witnessed Resolution Profiles

Add code
May 27, 2026
Viaarxiv icon

AndroidDaily: A Verifiable Benchmark for Mobile GUI Agents on Real-World Closed-Source Applications

Add code
May 26, 2026
Viaarxiv icon

StepAudio 2.5 Technical Report

Add code
May 22, 2026
Viaarxiv icon

Vision Foundation Models as Generalist Tokenizers for Image Generation

Add code
May 18, 2026
Viaarxiv icon

Step-Audio-R1.5 Technical Report

Add code
Apr 28, 2026
Viaarxiv icon

Spike-NVPT: Learning Robust Visual Prompts via Bio-Inspired Temporal Filtering and Discretization

Add code
Apr 20, 2026
Viaarxiv icon

SpatialEvo: Self-Evolving Spatial Intelligence via Deterministic Geometric Environments

Add code
Apr 15, 2026
Viaarxiv icon

Why Your Tokenizer Fails in Information Fusion: A Timing-Aware Pre-Quantization Fusion for Video-Enhanced Audio Tokenization

Add code
Apr 13, 2026
Viaarxiv icon

When the Specification Emerges: Benchmarking Faithfulness Loss in Long-Horizon Coding Agents

Add code
Mar 17, 2026
Viaarxiv icon